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ABSTRACT 

Recurrent neural networks (RNNs) have shown outstanding perfor¬ 
mance on processing sequence data. However, they suffer from long 
training time, which demands parallel implementations of the train¬ 
ing procedure. Parallelization of the training algorithms for RNNs 
are very challenging because internal recurrent paths form depen¬ 
dencies between two different time frames. In this paper, we first 
propose a generalized graph-based RNN structure that covers the 
most popular long short-term memory (LSTM) network. Then, we 
present a parallelization approach that automatically explores par¬ 
allelisms of arbitrary RNNs by analyzing the graph structure. The 
experimental results show that the proposed approach shows great 
speed-up even with a single training stream, and further accelerates 
the training when combined with multiple parallel training streams. 

Index Terms — Recurrent neural network (RNN), long short¬ 
term memory (LSTM), generalization, parallelization, graphics pro¬ 
cessing unit (GPU) 

1. INTRODUCTION 

Deep neural networks have shown quite impressive performances 
in several pattern recognition applications (mi. Among the deep 
neural networks, the feed-forward networks are suitable for process¬ 
ing input data with a fixed length, and they are usually used for im¬ 
age and phoneme recognition. On the other hand, recurrent neural 
networks (RNNs) employ feedback inside, and they are suitable for 
processing input data whose dimension is not fixed or limited. For 
example, automatic speech recognition (ASR) systems can perform 
better with an RNN-based language modeling (3). 

Since RNNs contain feed-back loops inside, the past input can 
be memorized and affect the current output. If RNNs are properly 
trained, it is possible to compress the input history effectively and 
yield good results even when there are considerable time delays be¬ 
tween the input and output. Especially, the long short-term memory 
(LSTM) RNN is known to solve the problems with long time lag 
very successfully [41 . 

However, the LSTM RNN employs a very complex component 
known to be the memory block. It demands much effort even for 
slight modification of the structure because of the difficulty in de¬ 
riving the corresponding training equation. Thus, it is needed to de¬ 
velop a generalized RNN structure that can be modified easily while 
representing LSTM networks perfectly. Previously, a generalized 
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LSTM-like RNN structure with real-time recurrent learning (RTRL) 
m was proposed in with special gated connections. However, we 
propose a much more general structure by introducing multiplicative 
layers and delayed connections. Also, we derive a backpropaga- 
tion through time (BPTT) 13 based training algorithm for our RNN 
structure, which is generally more flexible than the RTRL-based one. 

RNNs also demand very long training time, thus implementation 
with GPUs or multiprocessors is needed. However, parallelization 
of the network is difficult due to dependency induced by the inter¬ 
nal feedback loops. The conventional approach uses independent 
multiple training streams that employs plural copies of the network 

However, this inter-stream parallelism demands huge memory, 
which is a serious bottleneck for GPU based implementations. 

In this paper, we propose a parallelization approach as well as 
the generalized RNN structure. For this purpose, we first develop 
training algorithms for the generalized RNNs. The training equa¬ 
tions of conventional LSTM can be perfectly represented with the 
generalized equations. Then, the parallelization approach exposes 
single-stream parallelization (intra-stream parallelism) that does not 
increase the size of mini-batches as the conventional multi-stream 
parallelization (inter-stream parallelism). Experimental results show 
that further speed-up can be achieved by combining the two paral¬ 
lelism. 

This paper is organized as follows. The generalized LSTM-like 
RNN structure is proposed and its training equations are derived in 
Section In Section the intra-stream parallelism of the gener¬ 
alized RNNs is explored and combined with the conventional inter¬ 
stream parallelism. In Section experimental results of the pro¬ 
posed approach on a GPU are presented, followed by concluding 
remarks in Section|5] 

2. GENERALIZATION 

To apply our parallelization approach to various types of RNNs, 
we first introduce a generalized RNN structure that can represent 
complex RNNs using simple basic blocks. This generalization fully 
covers advanced LSTM network structures with forget gates and 
peephole connections, and their BPTT-based training algorithm. 
Also, with the generalized RNN, one can easily design a new RNN 
structure quite easily since every equation and the parallelization 
approach remain the same. 

2.1. Generalized RNN structure 

The proposed generalized RNN structure is basically a directed 
graph, which consists of a set of nodes and edges. Each node repre¬ 
sents a layer and each edge makes a connection between two layers. 



There are two types of connections: delayed or not. A delayed 
connection makes a fixed amount of delay on the signal, and is used 
to construct a recurrent loop. More specifically, the connection m 
propagates the activation of the source layer k at the frame t — dm 
to the destination layer at the frame t as 

Zm{t) ='Wmyk{t - dm), (1) 

where Zm is the output of the connection m, Wm is the correspond¬ 
ing weight matrix, yk is the activation of the source layer k, and dm 
is the amount of delay at the connection m. The value of dm is 0 for 
non-delayed connections and larger than 0 for delayed connections. 

In an additive layer, the inputs are summed up and the activation 
function is applied on it: 

Sfe(f) = ^ Zm{t) (2) 

meA^. 

ykit) = fk{sk{t)), (3) 

where Sk is the state (input), Ak is the set of the indices of the an¬ 
terior connections, yt is the activation, and /fc(-) is the activation 
function of the layer k. In addition to the normal additive layers, 
multiplicative layers are employed to represent gate units of LSTM 
networks. A multiplicative layer performs element-wise multiplica¬ 
tion of input vectors (or matrices for batched computation) as fol¬ 
lows: 

Sk,i{t) = (4) 

mgAfc 

where the subscript i represents the index of elements in a vector. 
For generality, we introduce an aggregation function gk{-) as 

Sk{t) = gk{{zmit)\m € Ak}), (5) 

where gk{-) is a vector addition function for an additive layer or an 
element-wise multiplication function for a multiplicative layer, or it 
can be other nonlinear functions to add further nonlinearity to the 
network. 

In the previous approach on the generalized LSTMs ID, the gate 
units are implemented with gated connections. However, the gated 
connection has two input layers, so cannot be regarded as an edge 
of a familiar directed graph structure, where each edge has one input 
and one output. 

In our approach, by introducing the multiplicative layers, LSTM 
gates can be regarded as normal nodes in a graph structure, which al¬ 
lows general graph algorithms to be directly applied in Section]^ As 
an example, Figure[T]shows a generalized representation of a single¬ 
layer LSTM network with forget gates and peephole connections. 

2.2. Training 

In this section, BPTT Q based training equations for the generalized 
RNN are derived. The objective is to minimize the following total 
error from to + 1 to Ii: 

^ E{t), ( 6 ) 


Softmax output 



Fig. 1. Generalized representation of an LSTM network with for¬ 
get gates and peephole connections. Thick arrows represent con¬ 
nections with full weight matrices. On the other hand, connections 
with the thin arrows have identity weight matrices. The numbers on 
the dashed lines indicate the corresponding delay amounts. A non¬ 
singleton strongly connected component (SCC) is drawn, of which 
nodes will be grouped into a single recurrent node to make the net¬ 
work acyclic. 

These two variables will be back-propagated at the backward pass. 
If the layer k is an output layer, Skj (t) should be initialized by com¬ 
paring the output with a desired output dkj (t) according to the error 
criterion defined by E{t) and the activation function of the output 
layer. Using the minimum cross-entropy criterion with the softmax 
activation function, 

Sk,jit) = dk,j{t) - yk,j{t). (9) 

If the layer k is not an output layer, 

r /,\ _ dZn,i{t dn) dyk,j{t) 

dzn.iii + dr^) dykj^t) dSk,j{t) 

( 10 ) 

= EE- ,i{t + dn)Wn,ijf'k{Sk,j{t)), (11) 

71^ i^I-n 

where Pk is the set of posterior connection indices of the layer k 
and In is the set of element indices of the vector Also, ^m,i(t) 
becomes 


where E{t) is the error at frame t. For convenience, we define two 
derivative variables as 


dkA (^) 


^m,i (f ) 


dsk,i{t) 

dZm,i{t) 


(7) 

( 8 ) 


Cm.j it) 


dE^°^’^\to,tt) dskjjt) 

dSk,j{t) dZm,j{t) 
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( 12 ) 

(13) 


where k is the index of the destination layer of the connection m. 
To truncate errors at f = fp, we backpropagate the two derivative 

















































































variables while t > tg where tg < to using o and (T3|. After 
the backward pass, the truncated error gradient of the connection 
m £ Pk can be acquired by 

dzrr^At) 

dWm,ij dZm,iit) dWm,ij 

— ^ ' ^m,i{t)yk,j{t (15) 

In matrix form, o can be represented as 

^k{t) ^ ^ WZcnit + d„) \ O fk{Sk{t)), (16) 

^ n6Pfc ^ 

where o denotes element-wise vector multiplication. If the layer k is 
an additive layer, then l |13| l becomes 

em{t)=Sk{t). (17) 

Otherwise for the multiplicative layer k. 



Fig. 2. Feed-forward representation of the LSTM network that is 
depicted in Figure[^ 


Emit) = 8k{t) oY\_'Z'n{t)y (18) 

71^ Ay, ,n^m 

where element-wise multiplications are performed with f([. The er¬ 
ror gradient matrix for the connection m £ Pk is computed by 

VW^ = - ^ em{t)yl{t - d^). (19) 

The error gradients can be used for the first order optimization meth¬ 
ods such as stochastic gradient descent. 

3. PARALLELIZATION 

Parallelization of RNN computation is quite challenging due to de¬ 
pendencies between two consecutive frames. The state of an RNN 
of the frame k cannot be determined until the computation for the 
frame fc — 1 is finished. In this section, we first develop a paralleliza¬ 
tion method for the forward and the backward pass with a single 
stream (intra-stream parallelism), and then extend the approach to a 
multi-stream case (inter-stream parallelism). 

3.1. Intra-stream parallelism 

The key concept of separating sequential parts from the parallel parts 
of an RNN is to determine loops in the RNN and group each loop 
into a single special node called a recurrent node. Then, the remain¬ 
ing structure becomes a directed acyclic graph (DAG), which can be 
easily parallelized as in a mini-batch based feed-forward neural net¬ 
work computation. Only the internal computations of the recurrent 
nodes are performed sequentially. 

More specifically, strongly connected components (SCCs) are 
found to determine which nodes should be grouped into a recurrent 
node. An SCC is a subgraph that is strongly connected, that is, there 
are one or more paths between every pair of two vertices inside the 
subgraph. An SCC analysis finds a set of SCCs that form a partition 
of the vertex set of the original graph. For SCCs that are singletons 
and do not contain a self-loop, the original nodes inside the SCCs 
remain unchanged. Otherwise, the nodes in each SCC are grouped 
into a single recurrent node. Then, the final graph becomes a DAG 


and be ready for parallel computation. An example of an LSTM net¬ 
work is shown in Figure]^ One of the famous algorithms for finding 
SCCs is the Tarjan’s strongly connected component algorithm 1^ . 
Tarjan’s algorithm also provides a reverse topological sort of the re¬ 
sulting DAG, which is useful to determine the activation order. 

Once an RNN is represented as a DAG, the forward computation 
becomes very similar to that of feedforward networks. As in the case 
of feedforward networks, computations of nodes and edges are per¬ 
formed in a topological order of the DAG. These operations can be 
done in parallel over several frames since the network is represented 
as a DAG and there are no dependencies between different frames 
except the isolated recurrent nodes. 

Recurrent nodes are subgraphs of the original RNN and should 
be computed sequentially. The computation of a recurrent node from 
frame to to ti in the forward pass requires t\ —fo-|-l sequential steps. 
In each step of the forward pass, delayed connections are computed 
first. Then the remaining part excluding the delayed connections 
becomes a DAG and can be computed in a topological order. The 
computation of a backward pass can be performed similarly with 
reversed topological orders. 

The sequential computations of recurrent nodes are quite expen¬ 
sive and often become a bottleneck of the overall performance. To 
speed up these sequential parts, we need to employ the multi-stream 
parallelization. 

3.2. Inter-stream parallelism 

Inter-stream parallelism can be explored in the multi-stream mode 
where an RNN processes N streams with independent contexts. This 
is equivalent to running N independent copies of the RNN. There¬ 
fore, the multi-stream mode greatly increases parallelism and the 
overall execution speed. Recently, this approach was successfully 
applied to speed up language model training with an Elman network 
on a GPU (BJ. 

For training an RNN in the multi-stream mode, the input and tar¬ 
get streams are usually given by connecting randomly ordered train¬ 
ing sequences. Since the lengths of the training sequences are very 
long, we apply the efficient version of truncated BPTT(/i), denoted 
as BPTT(fi; h') proposed in 1101 . BPTT(fi; h') is similar to the or- 
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Fig. 3. Comparison of language model training speeds with Elman 
and LSTM networks. The LSTM employs forget gates and peephole 
connections. The sizes of the input layer, hidden or LSTM layer, 
and output layer is 38,000, 512, and 20,000 respectively. The mini¬ 
batch size is fixed to 1,024, so the error propagates from 1,024/A'^ to 
2,048/A'^ — 1 previous steps where N is the number of streams. 


dinary truncated BPTT(fi) in that the network is unrolled h times. 
However, in the forward pass of BPTT(/i; h'), h' time steps are com¬ 
puted at once. Also, the error gradients for the recent h' output errors 
are obtained by one iteration. These error gradients are summed up 
over the N training streams. Therefore, output errors of total N x h' 
frames affect the error gradients when updating weights after back¬ 
ward passes. We call the set of these frames as a mini-batch through¬ 
out the paper, as it is equivalent to a mini-batch in stochastic gradient 
descent methods of feedforward neural networks. 

Increasing N also speeds up the training. However, we cannot 
make N very large since the size of a mini-batch, N x h', is limited 
by the physical memory size of a GPU. Moreover, increasing the 
size of a mini-batch results in infrequent update of the weights and 
may slow down the convergence CD- Also, the parameter h' cannot 
be easily modified since the training speed is approximately propor¬ 
tional to the ratio of h' to h. For simplicity, let us assume h = 2h' 
to fix the training speed. In this case, error propagates through h' to 
2h' — 1 previous time steps in backward pass. Therefore h' should 
be set sufficiently large to solve long time lag problems. 

4. EXPERIMENTAL RESULTS 

Nvidia Tesla K40 GPU is used for the following experiments. For 
all experiments, BPTT(2fi; h) is used for simplicity. Since the train¬ 
ing algorithm for the generalized RNN structure is mathematically 
equivalent to that of Elman or LSTM networks, results with perfor¬ 
mance measures such as accuracy or the mean squared error (MSE) 
are not reported. 

To evaluate the proposed parallelization approach, we evaluate 
the language model training speed with the multi-stream mode as in 
t8l . The RNN architecture is an Elman network with 38,000 input, 
512 hidden, and 20,000 output units. The mini-batch size is fixed 
to 1,024 to use the same amount of GPU memory. Hence, with N 
streams, h = 1,024/^^ and the error propagates from 1,024/Af to 
2,048/Ai — 1 previous time steps. For comparison, an LSTM ver¬ 
sion of the network with forget gates and peephole connections are 
also evaluated. Note that the LSTM network has no self recurrent 



Fig. 4. Comparison of GPU processing power utilizations when 
training LSTM networks with the three different sizes of LSTM lay¬ 
ers: 1,024, 2,048, and 4,096. The input and output layers have the 
same size as the LSTM layer. Also, the theoretical peak performance 
of Tesla K40 GPU is shown. The mini-batch size is fixed to 1,024. 


connection from the output of the LSTM layer to the input of that. 

The training speeds are compared in Figure[^with varying num¬ 
ber of streams. Since the baseline approaches does not exploit intra¬ 
stream parallelism, they show poor training speeds when the number 
of streams are small. On the other hand, the proposed approach em¬ 
ploys intra-stream parallelism and shows over 10 times of speed-up 
over the baseline approach when a single stream is used. Also, with 
the proposed approach, we can obtain almost the maximum speed 
only with 64 streams. This is a nice advantage since using less num¬ 
ber of streams allows RNNs to learn longer time lags when the size 
of mini-batch is limited, as discussed in Section |T2l 

To analyze scalability and GPU efficiency with various size of 
networks, we perform another experiment with LSTM networks with 
forget gates and peephole connections. All layers of each network 
have the same size, which is 1,024, 2,048, or 4,096. To examine the 
GPU utilizations, we present the number of single-precision float¬ 
ing point operations per second (FLOPS) in Figure]^ along with the 
theoretical peak performance of Tesla K40 GPU. Note that only the 
operations for parameters and error gradients are counted. Com¬ 
pared to the previous experiment where the input and output layers 
are very large, this example is much closer to the deep RNN archi¬ 
tectures in terms of the ratio of the sequential computations (inside 
the recurrent nodes) to the parallel computations. As shown in the 
figure, the GPU utilization gets higher as the layer size or the num¬ 
ber of streams increases. Also, the intra-stream parallelism further 
accelerates the training especially with the small number of streams. 

5. CONCLUDING REMARKS 

We introduced a generalized structure for RNNs which covers 
LSTM networks with forget gates and peephole connections. This 
generalized structure is represented as a directed graph where nodes 
and edges correspond to layers and connections, respectively. Due 
to the graph representation, we can automatically find loops inside 
RNNs using the Tarjan’s strongly connected component algorithm 
and explore intra-stream parallelism. The proposed intra-stream par¬ 
allelism is combined with inter-stream parallelism in multi-stream 
mode for further acceleration. The experiments show that exploiting 
these two parallelisms greatly speeds up the training task on a GPU. 
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